LN: Shinn et al. (2023) — Reflexion: Language Agents with Verbal Reinforcement Learning

Bibliographic Reference

Citation: Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366. https://arxiv.org/abs/2303.11366 Venue: NeurIPS 2023.


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategorySystem proposal + empirical evaluation
ContextExtends ReAct (Yao et al., 2022) with a self-critique loop; addresses the “repetitive failure” weakness of ReAct
CorrectnessEvaluated on HotpotQA, HumanEval (coding), and AlfWorld (sequential decision-making). Results are reproducible.
Contributions(1) Verbal reinforcement: agents self-reflect on failures in natural language; (2) Episodic memory: reflections persist across trials without weight updates; (3) 91% on HumanEval vs. 80% GPT-4 zero-shot
ClarityExcellent. Architecture diagram is clear. Prompts are fully disclosed.

Relevance: ⭐⭐⭐⭐⭐

Direct applicability to PUMA Stage 4–5: agents can self-critique classification decisions and improve iteratively without retraining.


Pass 2 — Content

Core Idea

Reflexion adds a self-reflection loop on top of the ReAct cycle:

[Trial t] → Thought → Action → Observation → (failure/success)
    ↓ if failure
[Reflector LLM] → Verbal critique → stored in Episodic Memory
    ↓
[Trial t+1] → Thought + Memory Context → improved Action

The key insight: natural language is a richer gradient signal than numeric reward. Instead of updating weights (which requires many samples), the agent writes a paragraph explaining why it failed and what to try differently.

Three Agent Components

ComponentRole
ActorGenerates actions using ReAct or CoT
EvaluatorScores the trajectory (binary or heuristic)
Self-ReflectorGenerates verbal critique from the failed trajectory

Key Results

BenchmarkReflexionBaseline
HumanEval (coding)91.0%80.1% (GPT-4)
HotpotQA70%68% (ReAct)
AlfWorld97%75% (ReAct)

Failure Modes

  • Memory is ephemeral (not persistent across separate sessions unless explicitly stored)
  • Reflexion can over-commit to a wrong reflection (“I should try X” even when X was actually correct)
  • Hallucinated reflections: the agent may generate plausible-sounding but incorrect self-critiques

Pass 3 — Virtual Reconstruction

Q1 (Can I use this for PUMA?): Yes. A PUMA triage agent that misclassifies a Jira issue can generate a verbal critique: “I classified this as Bug when it is Enhancement because I over-weighted the word ‘crash’ in the title without reading the description.” This reflection is stored in a rolling buffer and included in the next classification prompt.

Q2 (What does 91% on HumanEval mean for PM tasks?): HumanEval is a coding benchmark — the transfer is indirect. For PUMA triage (classification), the relevant gain is the AlfWorld sequential decision result (97%), which is structurally more similar.

Q3 (Reproducibility): The paper provides full prompts. Combined with temperature=0 and seed=42 (PUMA constitution), the Reflexion loop is deterministic and reproducible.

Implementation sketch for PUMA:

reflections = []
for trial in range(max_trials):
    result = triage_agent.run(issue, reflections=reflections)
    if result.is_correct():
        break
    reflection = reflector_llm.generate(
        trajectory=result.trajectory,
        correct_label=ground_truth
    )
    reflections.append(reflection)

PUMA Integration

  • Stage 4 (RAG Triage): Reflexion loop for iterative self-improvement on misclassified issues in the training/validation set
  • Stage 5 (SmartPMO): Manager agent can reflect on coordination failures and adjust sub-agent routing
  • Spec: SP-Triage-Agent-v1 — add Reflexion loop to triage agent spec
  • Experiment: EX-Stages-Overview — consider Reflexion as Stage 4 ablation

MOCs