⚔️ Grok — Adversarial Critique and Stress Testing for PUMA

Tool: Grok (xAI) Phase: Phase 1 — Research Step: 07 — Critical Review Methodology: Adversarial review + Stress testing


Prompt A — Skeptical Benchmark Review

Role: Skeptical academic reviewer trying to reject an overconfident paper.

Context: PUMA claims that local LLM agents (Llama 3.2 8B) can significantly outperform majority-class baselines for Jira issue triage and story point estimation.

Find what could go wrong with this claim:
1. Challenge the problem statement: is automated issue triage actually a valuable problem, or is it already solved by simpler tools?
2. Challenge the datasets: what are the known limitations of Jira SR and TAWOS for evaluating LLM triage?
3. Challenge the evaluation metrics: is F1-macro the right metric for imbalanced multi-class triage? What about weighted-F1 or per-class F1?
4. Challenge the baseline: is a majority-class baseline too weak to be meaningful? What should a fair baseline be?
5. Challenge the generalization claim: how confident are we that Llama 3.2 8B results transfer to real PM settings?

Format: Critical memo with major objections, hidden assumptions, rival hypotheses, and specific design fixes for each objection.

Prompt B — Failure Mode Extraction

Role: Failure analyst designing a stress test for AI agents in project management.

Context: PUMA's triage agent classifies issue priority. I want PUMA to detect realistic failure modes, not just report average F1-macro.

List the likely failure modes of LLM agents in issue triage tasks:
1. Hallucinations (priority not supported by issue text)
2. Context drift (agent loses track of priority criteria in few-shot examples)
3. Overconfidence (high predicted probability for wrong class)
4. Brittle routing (performance collapses for certain issue types: bugs vs. features vs. improvements)
5. Sensitivity to prompt wording (small prompt changes cause large F1 swings)
6. Poor calibration (confident when wrong, uncertain when right)

For each failure mode: (a) what it looks like in practice, (b) how it can be measured in PUMA's experiment, (c) severity level for PM use, (d) how PUMA should expose it in the results.

Format: Failure mode table with measurement strategy.

Prompt C — Adversarial Prompt Inspection

Role: Prompt adversarial auditor.

Context: PUMA uses a zero-shot classification prompt for Llama 3.2 8B. The prompt is:
[paste the actual PUMA triage prompt here]

Analyze the prompt for: (1) ambiguous phrases that might cause inconsistent model behavior, (2) parts that are too broad (leaving too much model freedom), (3) parts that are too narrow (artificially constraining the model), (4) potential for the model to "guess" the correct answer from prompt structure rather than issue content.

Produce: (a) a weaker version (shows the flaw most clearly), (b) a stronger version (fixes the flaw), (c) a safer version (maximizes constraint and auditability).

Format: Audit table: Issue | Risk | Stronger Version | Safer Version.

Prompt D — Contrarian Literature Synthesis

Role: Contrarian synthesizer for the PUMA research field.

Context: The AI+project management literature generally paints an optimistic picture of LLM agents for PM tasks.

Write the strongest skeptical interpretation of the field (2023–2026):
1. Why might current evidence be overstated? (dataset bias, cherry-picked metrics, weak baselines)
2. Where might performance claims NOT transfer to real PM environments?
3. What experiments would most likely falsify the optimistic claims?
4. What must PUMA test specifically to avoid repeating these mistakes?

Format: Critical synthesis with 4 sections and a final checklist of what PUMA MUST do to be credible.

PUMA Relevance

Grok’s adversarial style is used to stress-test PUMA’s claims before TFG submission. The failure mode analysis generates the “Limitations” section content for PR-PUMA-Ch4-Results. The contrarian synthesis informs the honest framing of results.


MOCs