Prompt: Claude Code — Triage Agent Implementation

Purpose: Implement the TriageAgent class from spec. Use after completing SP-Triage-Agent.


📋 Prompt 1 — Project Setup

# Run in terminal with Claude Code active in project root
claude "
Set up the PUMA benchmark project structure. 
 
Current directory: /path/to/puma/
Create the following structure:
puma/
  src/
    agents/
      __init__.py
      triage_agent.py      ← main implementation
      estimation_agent.py  ← stub only
    data/
      __init__.py
      loader.py            ← dataset loading utilities
    evaluation/
      __init__.py
      metrics.py           ← F1, MAE, Wilcoxon functions
      carbon.py            ← CodeCarbon integration wrapper
    prompts/
      __init__.py
      templates.py         ← all prompt templates as constants
    utils/
      __init__.py
      ollama_client.py     ← Ollama API wrapper
      logger.py            ← structured JSON logger
  tests/
    test_triage_agent.py
    test_metrics.py
    test_ollama_client.py
  notebooks/
    01_triage_exploration.ipynb
  docs/
    specs/
      SP-Architecture-v1.md   ← copy from Obsidian
  requirements.txt
  README.md
  .gitignore
 
Use Python 3.11. Add type hints everywhere. 
Create proper __init__.py files with public API exports.
"

📋 Prompt 2 — Implement TriageAgent

claude "
Implement the TriageAgent class in src/agents/triage_agent.py.
 
Spec requirements (from SP-Triage-Agent-v1.md):
- Accept: model (str), strategy ('zero-shot'|'few-shot-3'|'few-shot-6'|'cot')
- Accept: issue dict with keys 'title', 'description' (optional)
- Return: dict with keys:
    predicted_priority (str: 'Critical'|'High'|'Medium'|'Low')
    raw_response (str)
    latency_s (float)
    emissions_gco2 (float)
    confidence (str: 'high'|'medium'|'low') — parsed from response structure
    model (str)
    strategy (str)
 
Implementation requirements:
1. Use OllamaClient from src/utils/ollama_client.py
2. Use prompt templates from src/prompts/templates.py
3. Integrate CodeCarbon EmissionsTracker per-call
4. seed=42, temperature=0 in all Ollama calls
5. Robust label parsing: if model outputs something other than the 4 classes,
   log the raw response and return 'Unknown'
6. Include comprehensive docstrings and type hints
7. No global state — the agent should be fully instantiable
 
After implementation:
- Write 5 unit tests in tests/test_triage_agent.py
- Tests must cover: valid output, edge cases (empty description, long title),
  invalid model name, label parsing of ambiguous output
"

📋 Prompt 3 — Benchmark Runner

claude "
Implement a benchmark runner script at src/run_benchmark.py.
 
The script should:
1. Load the Jira SR subset (200 issues, stratified CSV)
2. For each combination of:
   - models: ['llama3.2:8b', 'mistral:7b']
   - strategies: ['zero-shot', 'few-shot-3', 'few-shot-6', 'cot']
3. Run TriageAgent on all 200 issues
4. Calculate metrics (F1-macro, precision, recall per class)
5. Run Wilcoxon test vs heuristic baseline
6. Log all results to results/triage_results_{timestamp}.json
7. Generate a markdown summary table to results/triage_summary_{timestamp}.md
 
Use tqdm for progress bars.
Add --dry-run flag that runs on only 5 issues for testing.
Add --model flag to run a single model only.
Reproducibility: all random operations use seed=42.
 
Output format for the JSON:
{
  'metadata': {run_date, seed, models, strategies, dataset},
  'results': [{model, strategy, issue_id, predicted, actual, latency_s, emissions_gco2}],
  'metrics': {model: {strategy: {f1_macro, precision, recall, wilcoxon_p, effect_r}}}
}
"


id: PT-Copilot-Scaffolding title: “Prompt: GitHub Copilot / Cursor AI — Code Scaffolding” type: prompt-template tags: [prompt, copilot, cursor, scaffolding, coding] tool: github-copilot methodology: cdd use_case: coding phase: F2 version: 1.0 tested: true effectiveness: medium created: 2026-03-01

Prompt: GitHub Copilot / Cursor AI — Code Scaffolding

Best for: Generating boilerplate code quickly from a clear specification. Less effective than Claude Code for complex multi-file tasks.

Pattern 1 — Function from Spec Comment

In the code file, write a detailed docstring/comment, then let Copilot complete:

def calculate_wilcoxon_vs_baseline(
    model_scores: list[float],
    baseline_scores: list[float], 
    alpha: float = 0.05
) -> dict:
    """
    Perform Wilcoxon signed-rank test comparing model F1 scores 
    against baseline F1 scores across the evaluation subset.
    
    Args:
        model_scores: List of per-instance F1 scores for LLM model
        baseline_scores: List of per-instance F1 scores for heuristic baseline
        alpha: Significance threshold (default 0.05 per PUMA protocol)
    
    Returns:
        dict with keys:
            statistic (float): Wilcoxon W statistic
            p_value (float): Two-sided p-value
            effect_size_r (float): Effect size r = Z / sqrt(N)
            is_significant (bool): p_value < alpha
            interpretation (str): Human-readable conclusion
    
    Example:
        >>> model = [0.7, 0.8, 0.65, 0.75]
        >>> baseline = [0.5, 0.55, 0.48, 0.52]
        >>> result = calculate_wilcoxon_vs_baseline(model, baseline)
        >>> result['is_significant']
        True
    """
    # [Copilot completes from here]

Pattern 2 — Cursor Chat for Context-Aware Refactoring

In Cursor chat (Cmd+L):

"In triage_agent.py, the parse_priority_label function 
currently uses a simple strip() and exact match. 
Improve it to handle:
1. Case-insensitive matching ('critical', 'CRITICAL', 'Critical')
2. Fuzzy matching for close typos ('Critcal' → 'Critical')
3. Matching when the label appears in a longer sentence 
   ('The priority is High based on...' → 'High')
4. Return 'Unknown' with logging if no match found

Do not change the function signature or return type."

Pattern 3 — Warp AI Terminal

# In Warp terminal, type naturally:
"run the benchmark for only mistral:7b with strategy zero-shot 
and save results to debug folder"
 
# Warp translates to:
python src/run_benchmark.py --model mistral:7b --strategy zero-shot --output-dir results/debug/