PN: Evaluation Metrics — Comprehensive Reference for PUMA

Core Idea

This permanent note catalogs all metrics used or referenced in PUMA experiments (H1 triage classification, H2 effort estimation) and in the broader LLM/agent evaluation literature. Each metric is defined, annotated with its applicable task type, and mapped to PUMA hypotheses.


Classification Metrics (H1 — Issue Triage)

Accuracy (Overall Accuracy)

  • Use: Single-label classification; valid only when class distribution is balanced
  • PUMA: Issue type classification over TAWOS dataset; misleading if one class (e.g., “Bug”) dominates
  • Limitation: Inflated in imbalanced datasets; prefer F1-macro or weighted F1

Precision

  • Fraction of predicted positives that are truly positive
  • High precision = few false alarms; relevant for routing (don’t assign wrong person)

Recall (Sensitivity)

  • Fraction of actual positives correctly identified
  • High recall = few missed issues; critical for priority triage (don’t miss critical bugs)

F1-Score (Binary)

  • Harmonic mean of Precision and Recall; accounts for both false positives and false negatives
  • PUMA H1: Binary F1 per class (Bug, Feature, Task, Improvement, etc.)

F1-Macro (Macro-F1)

  • Unweighted average of per-class F1; gives equal weight to all classes
  • PUMA: Primary H1 metric — avoids majority-class bias; reflects performance on rare issue types
  • Interpretation: 0.7–0.8 = good; >0.85 = excellent for multi-class NLP classification

F1-Micro (Micro-F1)

  • Aggregates TP/FP/FN across all classes; equals overall accuracy for single-label classification
  • PUMA: Useful for reporting aggregate performance; favors majority class

F1-Weighted (Weighted F1)

  • Class-support-weighted average; reflects real-world distribution
  • PUMA: Report alongside Macro-F1 to show per-distribution performance

AUC-ROC

  • Area Under the Receiver Operating Characteristic Curve
  • ROC: plots TPR (Recall) vs. FPR at varying thresholds
  • AUC: 1.0 = perfect; 0.5 = random baseline; <0.5 = worse than random
  • PUMA: Use for binary subtasks (Critical vs. Non-Critical priority) where threshold choice matters
  • Multi-class extension: macro-averaged OvR (One-vs-Rest) AUC

Regression / Estimation Metrics (H2 — Effort Estimation)

MAE (Mean Absolute Error)

  • Measures average absolute deviation in original story-point units
  • Robust to outliers; intuitive (“off by X story points on average”)
  • PUMA H2: Primary metric for story point estimation (units: story points)

MdAE (Median Absolute Error)

  • More robust than MAE to extreme story-point outliers (e.g., 40 SP tasks skewing means)
  • PUMA: Report as a complementary robustness check alongside MAE

RMSE (Root Mean Square Error)

  • Penalizes large errors quadratically; sensitive to outliers
  • PUMA: Report to show penalty for large misestimates (e.g., estimating 1 SP for a 13 SP task)

SA (Standardized Accuracy)

where is the expected MAE of a random predictor (or majority-class baseline).

  • Origin: Shepperd & MacDonell (2012) — standard in software effort estimation literature
  • PUMA H2: Normalizes MAE against a baseline; values >0 = better than random; <0 = worse
  • Interpretation: SA = 0.3 means 30% better than a random guess

Spearman’s ρ (Rho)

where is the rank difference between predicted and actual values.

  • Non-parametric rank correlation; does not assume normality (appropriate for SP distributions)
  • PUMA H2: Tests whether estimated ranking of issue effort matches actual ranking
  • Interpretation: 0.5–0.7 = moderate; >0.7 = strong correlation for ordinal SP data

Statistical Validation Metrics

Wilcoxon Signed-Rank Test

  • Non-parametric test for paired samples; tests whether median of differences = 0
  • When to use: Compare two agent configurations on same set of issues; when SP data is non-normal
  • PUMA: Compare H1 Macro-F1 of Zero-Shot vs. Few-Shot CoT per issue type
  • Outputs: W statistic, p-value, effect size r = Z/√N
  • Threshold: p < 0.05 for significance; p < 0.01 for strong evidence

Shapiro-Wilk Test

  • Tests whether a sample is normally distributed (H₀: normality)
  • PUMA: Run before choosing parametric (t-test) vs. non-parametric (Wilcoxon) tests
  • Sensitive to sample size: for n > 50, use visual QQ-plot + Kolmogorov-Smirnov

Cohen’s d (Effect Size)

  • Standardized mean difference; interpretable: 0.2 = small, 0.5 = medium, 0.8 = large
  • PUMA: Quantify practical significance beyond p-value; required for research papers

r (Effect Size for Wilcoxon)

  • Effect size for non-parametric tests; 0.1 = small, 0.3 = medium, 0.5 = large
  • PUMA: Report alongside Wilcoxon p-value

Confidence Interval (CI)

  • Bootstrap CI: Preferred when distribution is unknown; resample 1000× to build empirical distribution
  • PUMA: Report 95% CI for all primary metrics (Macro-F1, MAE, SA); conveys precision of estimates

LLM / Text Generation Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

  • ROUGE-N: n-gram overlap between hypothesis and reference
  • ROUGE-L: Longest Common Subsequence (LCS) based F-score
  • Use: Summarization, explanation quality; not primary for PUMA classification tasks
  • PUMA: Optional metric for evaluating quality of agent-generated issue summaries (Stage 4)

BLEU (Bilingual Evaluation Understudy)

  • Precision-based n-gram overlap with brevity penalty
  • Limitation: Poor correlation with human judgment for short or creative text
  • PUMA: Not primary; use only if comparing agent-generated triage rationale to human explanations

Perplexity

  • Measures how well a language model predicts a sequence; lower = more “surprised”
  • PUMA: Not a primary metric; relevant if evaluating fine-tuned model quality vs. base model

Recall@k

  • Fraction of relevant items appearing in top-k retrieval results
  • PUMA: For RAG-based few-shot retrieval (H2 Stage 3); higher Recall@k = better reference examples surfaced

MRR (Mean Reciprocal Rank)

  • Evaluates how high the first relevant result appears in a ranked list
  • PUMA: Evaluate quality of historical issue retrieval in few-shot construction pipeline

Agent / LLM Operational Metrics

Successful Parsing Rate (SPR)

\text{SPR} = \frac{\text{# valid JSON responses}}{\text{# total LLM calls}} \times 100\%

  • Fraction of LLM outputs that parse successfully as valid structured output (JSON/YAML/Markdown table)
  • Origin: Derived from AgentBench (Liu et al., ICLR 2024) failure analysis; named by PUMA
  • PUMA: Track per model, per prompt strategy; SPR < 90% = reliability risk
  • Distinguishes format failure (valid JSON but wrong fields) from parse failure (invalid syntax)

Inference Latency

  • End-to-end wall-clock time per LLM call
  • PUMA: Report p50, p95, p99 latency across 500 issue experiments; compare cloud vs. local models
  • Critical for real-time triage SLA analysis

Tokens per Second (t/s)

  • Throughput metric for local model inference (Ollama, llama.cpp, vLLM)
  • PUMA: Benchmark Llama 3.2 8B, Mistral 7B, Phi-3.5 Mini on PUMA hardware

AIOps-Specific Metrics

MTTD (Mean Time to Detect)

  • Time from incident start to detection alert
  • Origin: ITSM/DevOps; standard metric in SRE and incident management
  • PUMA: Analogous to time from issue submission to triage completion

MTTR (Mean Time to Resolve)

  • Time from detection to full resolution
  • PUMA: Baseline metric; compare manual MTTR vs. PUMA-assisted MTTR in future field studies

Sustainability Metrics

Carbon Footprint (CO₂ equivalent)

where = energy consumed (kWh), = carbon intensity of electricity grid (kg CO₂/kWh)

  • Tools: CodeCarbon (Python library, automatic tracking), Experiment Impact Tracker
  • Reference: Strubell et al. (2019) — NLP training emissions analysis; Lottick et al. (2019)
  • PUMA: Report CO₂eq for all local model experiments; include in ethics/sustainability chapter
  • PUE (Power Usage Effectiveness) = total facility power / IT equipment power; ideal PUE ≈ 1.0

Metric Selection Guide for PUMA

TaskPrimary MetricSecondary MetricsValidation
H1 Issue Type ClassificationMacro-F1Weighted-F1, AccuracyWilcoxon (model pairs)
H1 Priority ClassificationMacro-F1AUC-ROC (Critical vs Rest)Cohen’s d
H1 Component RoutingMacro-F1Recall (per component)95% Bootstrap CI
H2 Story Point EstimationMAE, SAMdAE, RMSE, Spearman ρWilcoxon + Cohen’s d
Few-shot Retrieval QualityRecall@kMRR
Agent ReliabilitySPRLatency p95
SustainabilityCO₂eq (kg)t/s, energy (kWh)

MOCs