LN: Liu et al. (2023) — AgentBench: Evaluating LLMs as Agents

Bibliographic Reference

Citation: Liu, X., Yu, H., Zhang, H., et al. (2023). AgentBench: Evaluating LLMs as agents. arXiv:2308.03688. https://arxiv.org/abs/2308.03688 Venue: ICLR 2024.

Pass 1 — Bird’s Eye View (5 Cs)

C	Assessment
Category	Benchmark paper + empirical evaluation
Context	First comprehensive benchmark for LLMs acting as agents; responds to the proliferation of agent systems with no common evaluation framework
Correctness	8 environments, 25+ LLMs tested. Results are systematic and reproducible. Open-sourced on GitHub.
Contributions	(1) 8 distinct environments: OS, DB, KG, ALFWorld, WebShop, Mind2Web, House3D, Minecraft; (2) Standardized protocol for agent evaluation; (3) GPT-4 dramatically outperforms open-source on real-world tasks; (4) Gap between chat and agent performance identified
Clarity	Excellent. Environment descriptions, scoring, and replication guide are clear.

Relevance: ⭐⭐⭐⭐⭐

PUMA uses Ollama-based open-source LLMs (Llama, Mistral). AgentBench’s finding that open-source models underperform GPT-4 on agentic tasks is a key threat to validity for PUMA.

Pass 2 — Content

The 8 Environments

Environment	Type	Task
OS	Code/Action	Bash scripting, file management
DB	Code/Action	SQL, database manipulation
KG	Reasoning	Knowledge graph QA
ALFWorld	Embodied	Text-based household tasks
WebShop	Web	E-commerce purchasing
Mind2Web	Web	Website navigation
House3D	Embodied	3D navigation
Minecraft	Embodied	Open-world tasks

Key Findings

Critical finding for PUMA

Open-source LLMs (Llama, Vicuna) score 5–30% of GPT-4 on most AgentBench tasks. This suggests that PUMA’s use of local LLMs (Llama 3, Mistral) will show significantly lower performance than cloud models — which must be explicitly addressed in the discussion chapter.

Model	Overall Score
GPT-4	~100 (normalized)
GPT-3.5	~52
Llama-2-70B	~18
Vicuna-33B	~14

Root Cause of Gap

Instruction following: open-source models frequently deviate from the expected output format
Long-context reasoning: degraded performance on tasks requiring multi-step reasoning over 4k+ tokens
Action grounding: models hallucinate actions not in the action space

Pass 3 — Virtual Reconstruction

Q1 (What does AgentBench tell PUMA?): PUMA Stage 1–2 are “tool-augmented classification” tasks, not full agentic tasks. The OS/DB environments are harder than PM triage. However, the format-following failures are directly relevant — PUMA must use structured output prompts and validate JSON responses.

Q2 (How does this affect H1/H2?): AgentBench establishes that open-source LLMs have systematic format-following weaknesses. PUMA should implement a successful parsing rate metric alongside F1/MAE to track how often the LLM returns parseable output.

Q3 (Benchmarking PUMA): PUMA could contribute a PM-specific agent evaluation environment to the AgentBench ecosystem — issue triage as a structured action space (Type × Priority × Component classification).

PUMA Integration

Threat to validity: AgentBench’s open-source capability gap must be discussed in Ch.5 Discussion
Metrics: Add “Successful Parsing Rate” metric to EX-Stages-Overview
Baseline choice: Justifies using GPT-4o as a cloud comparison baseline in PUMA Stage 1
Spec: SP-Triage-Agent — output format validation required

PN-KeyConcepts-Agents-Reproducibility-RedTeam — AgentBench as evaluation framework
PN-MultiAgent-ArchitecturePatterns — environment taxonomy
LN-Jimenez-2023-SWEbench — complementary coding benchmark
LN-Mialon-2023-GAIA — GAIA general benchmark
LN-Masterman-2024-AgentArchSurvey — architecture survey

PUMA Vault

Explorador

AgentBench: Evaluating LLMs as Agents

LN: Liu et al. (2023) — AgentBench: Evaluating LLMs as Agents

Pass 1 — Bird’s Eye View (5 Cs)

Pass 2 — Content

The 8 Environments

Key Findings

Root Cause of Gap

Pass 3 — Virtual Reconstruction

PUMA Integration

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

AgentBench: Evaluating LLMs as Agents

LN: Liu et al. (2023) — AgentBench: Evaluating LLMs as Agents

Pass 1 — Bird’s Eye View (5 Cs)

Pass 2 — Content

The 8 Environments

Key Findings

Root Cause of Gap

Pass 3 — Virtual Reconstruction

PUMA Integration

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces