LN: Yao et al. (2022) — ReAct: Synergizing Reasoning and Acting

Bibliographic Reference

Citation: Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing reasoning and acting in language models. arXiv:2210.03629. https://arxiv.org/abs/2210.03629 Venue: Presented at ICLR 2023.


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategorySystem proposal + empirical evaluation
ContextBuilds on CoT (Wei et al., 2022) and task-specific action spaces
CorrectnessRigorous evaluation on HotpotQA, FEVER, ALFWorld, WebShop
Contributions(1) ReAct paradigm: interleaves Thought–Action–Observation cycles; (2) 69% on HotpotQA vs 28% for Act-only; (3) Reduces hallucination via grounding in external observations
ClarityExcellent. Clear prompting examples.

Relevance: ⭐⭐⭐⭐⭐

Foundation paper for PUMA Stage 4–5 agent architecture.


Pass 2 — Content

Core Idea

ReAct = Reasoning + Acting. At each step an LLM alternates between:

  • Thought: internal reasoning about what to do
  • Action: calling an external tool (search, API, database)
  • Observation: reading the tool output and continuing

This loop continues until the agent reaches a final answer. The key insight: verbal reasoning traces reduce hallucination because the agent’s next action is grounded in what it actually retrieved, not in what it “remembers.”

Key Results

  • HotpotQA: ReAct 69% vs. CoT 57% vs. Act-only 28%
  • FEVER: ReAct 80% vs. CoT 66%
  • ALFWorld: ReAct outperforms imitation learning on 6/6 task types
  • WebShop: ReAct 40.7% vs. Act-only 35.3%

Failure Modes

  • Repetitive loops when the tool returns unexpected results
  • Inability to recover from an incorrect reasoning path (→ addressed by Reflexion, Shinn et al. 2023)

Pass 3 — Virtual Reconstruction

ReAct’s contribution is essentially a prompt format: interleaving reasoning and action in the few-shot examples is all that is needed to make standard LLMs exhibit agent-like behaviour. This is practically significant for PUMA: our agents can use ReAct-style prompting without any fine-tuning.

Q1 (How can I use this?): ReAct is the base pattern for PUMA Stage 4 triage agent. When the agent retrieves similar historical issues (RAG step), the Thought–Action–Observation loop structures that retrieval into the classification decision.

Q2 (Does it do what it claims?): The HotpotQA benchmark involves factual QA with Wikipedia. The transfer to PM triage is non-trivial — the “observations” in PM would be retrieved Jira issues, not Wikipedia paragraphs. The gain may differ.

Q3 (What if?): What if the LLM generates internally inconsistent Thought steps? PUMA could implement a simple consistency check: verify that the final label in the Answer step matches the reasoning in the Thought step.


PUMA Integration

MOCs