LN: Shinn et al. (2023) — Reflexion: Language Agents with Verbal Reinforcement Learning

Bibliographic Reference

Citation: Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366. https://arxiv.org/abs/2303.11366 Venue: NeurIPS 2023.

Pass 1 — Bird’s Eye View (5 Cs)

C	Assessment
Category	System proposal + empirical evaluation
Context	Extends ReAct (Yao et al., 2022) with a self-critique loop; addresses the “repetitive failure” weakness of ReAct
Correctness	Evaluated on HotpotQA, HumanEval (coding), and AlfWorld (sequential decision-making). Results are reproducible.
Contributions	(1) Verbal reinforcement: agents self-reflect on failures in natural language; (2) Episodic memory: reflections persist across trials without weight updates; (3) 91% on HumanEval vs. 80% GPT-4 zero-shot
Clarity	Excellent. Architecture diagram is clear. Prompts are fully disclosed.

Relevance: ⭐⭐⭐⭐⭐

Direct applicability to PUMA Stage 4–5: agents can self-critique classification decisions and improve iteratively without retraining.

Pass 2 — Content

Core Idea

Reflexion adds a self-reflection loop on top of the ReAct cycle:

[Trial t] → Thought → Action → Observation → (failure/success)
    ↓ if failure
[Reflector LLM] → Verbal critique → stored in Episodic Memory
    ↓
[Trial t+1] → Thought + Memory Context → improved Action

The key insight: natural language is a richer gradient signal than numeric reward. Instead of updating weights (which requires many samples), the agent writes a paragraph explaining why it failed and what to try differently.

Three Agent Components

Component	Role
Actor	Generates actions using ReAct or CoT
Evaluator	Scores the trajectory (binary or heuristic)
Self-Reflector	Generates verbal critique from the failed trajectory

Key Results

Benchmark	Reflexion	Baseline
HumanEval (coding)	91.0%	80.1% (GPT-4)
HotpotQA	70%	68% (ReAct)
AlfWorld	97%	75% (ReAct)

Failure Modes

Memory is ephemeral (not persistent across separate sessions unless explicitly stored)
Reflexion can over-commit to a wrong reflection (“I should try X” even when X was actually correct)
Hallucinated reflections: the agent may generate plausible-sounding but incorrect self-critiques

Pass 3 — Virtual Reconstruction

Q1 (Can I use this for PUMA?): Yes. A PUMA triage agent that misclassifies a Jira issue can generate a verbal critique: “I classified this as Bug when it is Enhancement because I over-weighted the word ‘crash’ in the title without reading the description.” This reflection is stored in a rolling buffer and included in the next classification prompt.

Q2 (What does 91% on HumanEval mean for PM tasks?): HumanEval is a coding benchmark — the transfer is indirect. For PUMA triage (classification), the relevant gain is the AlfWorld sequential decision result (97%), which is structurally more similar.

Q3 (Reproducibility): The paper provides full prompts. Combined with temperature=0 and seed=42 (PUMA constitution), the Reflexion loop is deterministic and reproducible.

Implementation sketch for PUMA:

reflections = []
for trial in range(max_trials):
    result = triage_agent.run(issue, reflections=reflections)
    if result.is_correct():
        break
    reflection = reflector_llm.generate(
        trajectory=result.trajectory,
        correct_label=ground_truth
   )
    reflections.append(reflection)

PUMA Integration

Stage 4 (RAG Triage): Reflexion loop for iterative self-improvement on misclassified issues in the training/validation set
Stage 5 (SmartPMO): Manager agent can reflect on coordination failures and adjust sub-agent routing
Spec: SP-Triage-Agent — add Reflexion loop to triage agent spec
Experiment: EX-Stages-Overview — consider Reflexion as Stage 4 ablation

PN-ReAct-AgentPattern — ReAct is the base pattern Reflexion extends
PN-MultiAgent-ArchitecturePatterns — Reflexion as a single-agent self-improvement loop
PN-CoT-FewShot-Prompting — CoT is used inside the Actor
LN-Yao-2022-ReAct — base pattern
LN-Zelikman-2024-QuietSTaR — related self-improvement

PUMA Vault

Explorador

Reflexion: Language Agents with Verbal Reinforcement Learning

LN: Shinn et al. (2023) — Reflexion: Language Agents with Verbal Reinforcement Learning

Pass 1 — Bird’s Eye View (5 Cs)

Pass 2 — Content

Core Idea

Three Agent Components

Key Results

Failure Modes

Pass 3 — Virtual Reconstruction

PUMA Integration

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Reflexion: Language Agents with Verbal Reinforcement Learning

LN: Shinn et al. (2023) — Reflexion: Language Agents with Verbal Reinforcement Learning

Pass 1 — Bird’s Eye View (5 Cs)

Pass 2 — Content

Core Idea

Three Agent Components

Key Results

Failure Modes

Pass 3 — Virtual Reconstruction

PUMA Integration

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces