🤖 BMAD Agent Prompts — PUMA Project

Overview

All prompts follow RCOIF structure: Role · Context · Objective · Instructions · Format. CDD principle: every prompt anchors to the PUMA project context explicitly. See BMAD-Agent-Roster for agent definitions.


Agent 1: Research Analyst Prompts

1.1 Keshav Pass-1 Batch Scan

ROLE: You are a senior software engineering researcher specialising in LLM evaluation and empirical methods (EBSE).

CONTEXT: I am conducting an SLR for PUMA Project. The project (PUMA) evaluates local LLM agents for ICT project management tasks: issue triage (Jira SR dataset) and effort estimation (TAWOS dataset). I am in Phase 0 (literature review). My research question: "Do different LLM models and prompting strategies produce statistically significant differences in issue triage quality and effort estimation, evaluated on real PM datasets?"

OBJECTIVE: Perform a Keshav Pass-1 assessment of the following paper [insert abstract + title]. Determine if it is relevant to PUMA.

INSTRUCTIONS:
1. Evaluate using the 5 Cs: Category, Context, Correctness, Contributions, Clarity
2. Identify if the paper: (a) uses PM datasets (Jira, TAWOS, GitHub Issues), (b) evaluates LLMs empirically, (c) addresses reproducibility, (d) measures prompting strategy effects
3. Rate relevance 1–5 (5 = directly addresses PUMA's gap)
4. State: READ PASS 2 or ARCHIVE with one-sentence reason

FORMAT: Structured output with 5 Cs table, relevance score, and decision.

1.2 Gap Mapping Prompt (EGI + DRCA)

ROLE: Expert in systematic literature reviews for software engineering + AI in project management.

CONTEXT: [Paste evidence table with 10–15 papers, their methods, metrics, and limitations]

OBJECTIVE: Map the specific research gap that PUMA fills. PUMA's three proposed gaps: (1) reproducibility — only 5/18 published LLM-SE papers are executable; (2) no systematic prompting strategy comparison for PM tasks; (3) no environmental sustainability measurement per PM task.

INSTRUCTIONS:
1. For each gap, identify: which papers address it partially / not at all
2. Confirm or challenge each gap claim with evidence from the table
3. Identify any gaps I have missed that the evidence table suggests
4. Use DRCA: deconstruct the "benchmark" concept → reconstruct what a rigorous PM benchmark requires

FORMAT: Gap matrix (Gap × Paper), narrative synthesis (200 words max), and 3 bullet challenges to my current framing.

Agent 2: Product Manager Prompts

2.1 Sprint Planning (GTD + BMAD)

ROLE: Experienced Agile Product Manager familiar with academic research project constraints.

CONTEXT: PUMA Project. Current sprint: PEC2 (deadline 2026-04-08). Completed: PEC1 (Ch.1 + environment). Remaining objectives: OE3 (datasets), OE4 (triage module + Wilcoxon). Hardware: 16GB RAM laptop, Ollama, Python 3.11. Time available: 10 days, ~3h/day.

OBJECTIVE: Generate a realistic daily sprint plan for the 10 working days before PEC2 deadline.

INSTRUCTIONS:
1. Break OE3 and OE4 into atomic tasks (≤2h each)
2. Order by dependency (dataset → baseline → experiments → analysis → writing)
3. Flag tasks that are critical path items
4. Include one daily buffer task (low-priority) for overflow
5. Mark risks with ⚠️

FORMAT: Markdown table with Day | Task | Hours | Dependencies | Risk

2.2 Scope Validation

ROLE: Academic advisor reviewing PUMA Project scope.

CONTEXT: PUMA MVP = triage module (Stage 1) + estimation module (Stage 2). Constitution: reproducible, local-only, open-source, HITL, falsifiable.

OBJECTIVE: Assess if the following proposed addition is within PUMA Project scope: [describe feature]. 

INSTRUCTIONS:
1. Check against constitution articles 1–7
2. Estimate added time cost in days
3. Classify: IN-SCOPE MVP / IN-SCOPE OBJECTIVE / OUT-OF-SCOPE (document as future work)
4. If borderline, provide modified version that stays within scope

FORMAT: Decision + 3-sentence justification + alternative if rejected.

Agent 3: Architect Prompts

3.1 Architecture Decision Review (SDD + OpenSpec)

ROLE: Senior software architect specialising in LLM systems and reproducible ML pipelines.

CONTEXT: PUMA system architecture: Ollama (inference) + Python 3.11 + pandas + scikit-learn + scipy + CodeCarbon. Four prompting strategies: zero-shot, few-shot-3, few-shot-6, chain-of-thought. Two models: Llama 3.2 8B, Mistral 7B. Dataset: 200 stratified Jira SR issues (seed=42).

OBJECTIVE: Review the following architecture decision and identify hidden risks: [describe decision].

INSTRUCTIONS:
1. Identify risks related to: reproducibility, latency, statistical validity, scope creep
2. Check compliance with PUMA Constitution (esp. Articles 1–5)
3. Propose alternative if risks are unacceptable
4. State what would need to change in SP-Architecture-v1

FORMAT: Risk table (Risk | Severity | Mitigation) + recommendation (APPROVE / MODIFY / REJECT).

3.2 Spec Validation (Spec Kit)

ROLE: Technical reviewer using Spec Kit methodology.

CONTEXT: [Paste spec file content]

OBJECTIVE: Validate this spec against PUMA's constitution.md principles and SDD best practices.

INSTRUCTIONS:
1. Check: Does it define WHAT not HOW? 
2. Check: Are acceptance criteria measurable and falsifiable?
3. Check: Does it conflict with any constitution article?
4. Identify: missing edge cases, undefined behaviours, ambiguous terms
5. Rate completeness 1–5

FORMAT: Line-by-line annotations + overall score + list of required changes before implementation.

Agent 4: Developer Prompts

4.1 Triage Agent Implementation (CDD + Zero-Shot CoT)

ROLE: Senior Python developer with expertise in LLM integration and reproducible ML pipelines.

CONTEXT: PUMA triage module. Stack: Ollama REST API, Python 3.11, pandas, scikit-learn. Dataset: Jira SR (200 issues, 4 priority classes: Blocker/Critical/Major/Minor). Reproducibility: seed=42, temperature=0. CodeCarbon must wrap every inference call.

OBJECTIVE: Implement the zero-shot prompting strategy for issue triage classification.

INSTRUCTIONS:
1. Create function `triage_zero_shot(issue_text: str, model: str) -> str`
2. Use Ollama REST API (http://localhost:11434/api/generate)
3. Prompt: classify priority as one of [Blocker, Critical, Major, Minor]. Return only the label.
4. Wrap with CodeCarbon tracker
5. Log: model, strategy, latency_ms, input_tokens (estimated), output_label
6. Handle errors gracefully (timeout → log and skip)
7. Write docstring with example usage

FORMAT: Production-quality Python with type hints, docstrings, and error handling. No Jupyter cells — module-level functions only.

4.2 Wilcoxon Test Implementation

ROLE: Statistician + Python developer familiar with non-parametric tests for SE research.

CONTEXT: PUMA Stage 1 results. I have F1-macro scores per condition (model × strategy) across 200 issues. I need to test H1: "at least one configuration achieves F1-macro statistically superior to the heuristic baseline" (Wilcoxon signed-rank, α=0.05, two-sided, effect size r = Z/√N).

OBJECTIVE: Write a Python function performing the complete Wilcoxon analysis.

INSTRUCTIONS:
1. Function signature: `wilcoxon_analysis(baseline_scores: list[float], model_scores: list[float], condition_name: str) -> dict`
2. Use scipy.stats.wilcoxon (two-sided)
3. Return: statistic, p_value, effect_size_r, reject_h0 (bool), interpretation (string)
4. Print formatted result matching APA 7 reporting style: W(N) = X, p = Y, r = Z
5. Include Bonferroni correction if multiple conditions tested

FORMAT: Clean Python function with docstring + example call + formatted output string.

Agent 5: QA / Red Teamer Prompts

5.1 Statistical Red-Teaming

ROLE: Statistical critic reviewing empirical claims for a peer-reviewed SE paper.

CONTEXT: PUMA claims: "Configuration X (Llama 3.2 8B + CoT) achieves F1-macro = 0.68, significantly outperforming heuristic baseline (0.41), W = 847, p = 0.003, r = 0.19 (small effect)."

OBJECTIVE: Identify all statistical weaknesses in this claim.

INSTRUCTIONS:
1. Check: Is the sample size (200) sufficient for Wilcoxon with these effect sizes?
2. Check: Is F1-macro appropriate for class imbalance in Jira SR?
3. Check: Is the comparison (against heuristic) the right baseline?
4. Check: Does the small effect size (r=0.19) undermine practical significance?
5. Propose 2 additional analyses that would strengthen or weaken the claim
6. Write the "threats to validity" paragraph for this result

FORMAT: Numbered critique list + threats-to-validity paragraph (150 words max).

5.2 Reproducibility Verification Checklist

ROLE: Reproducibility auditor for computational research.

CONTEXT: PUMA repository structure: [describe].

OBJECTIVE: Generate a reproducibility verification script and checklist.

INSTRUCTIONS:
1. List all steps a new researcher must follow to reproduce results from scratch
2. Identify: what must be pinned (model versions, Python versions, dataset checksums)
3. Generate a `verify_reproduction.sh` script that: downloads datasets, installs dependencies, runs a single condition, compares output to stored expected results (within tolerance)
4. Flag any non-reproducible elements (random elements not seeded, external API calls)

FORMAT: Checklist in Markdown + bash script + list of known reproducibility risks.

Agent 6: Sustainability Prompts

6.1 Carbon Report Generation

ROLE: Environmental impact analyst specialising in ML carbon accounting (Strubell methodology).

CONTEXT: PUMA experiments. CodeCarbon output: [paste JSON]. Conditions: 2 models × 4 strategies × 200 issues = 1,600 inference calls.

OBJECTIVE: Generate the sustainability section for the PUMA Project.

INSTRUCTIONS:
1. Calculate: total gCO₂eq, gCO₂eq per condition, gCO₂eq per classified issue
2. Compare to: equivalent cloud inference (estimate using PUE factors from Strubell 2019)
3. Write the sustainability subsection (150 words) following academic style
4. Suggest: which configuration minimises carbon per unit quality (F1/gCO₂eq ratio)

FORMAT: Summary table (condition × gCO₂eq × F1-macro × efficiency) + narrative paragraph.

Prompt library v1.0 · April 2026 · All prompts follow RCOIF + CDD + PUMA context